The journey of data visualization begins with four fundamental questions:
Categorical data represents discrete groups or categories. Common visualization methods include:
# Creating a simple bar plot
library(ggplot2)
# Sample data
categories <- c("A", "B", "C", "D")
values <- c(23, 45, 12, 78)
data <- data.frame(categories, values)
# Basic bar plot
ggplot(data, aes(x = categories, y = values)) +
geom_bar(stat = "identity", fill = "steelblue") +
theme_minimal() +
labs(title = "Simple Bar Plot",
x = "Categories",
y = "Values")
# Stacked bar chart example
library(ggplot2)
# Create sample data for stacked bar chart
stacked_data <- data.frame(
year = rep(2020:2023, each = 3),
category = rep(c("Product A", "Product B", "Product C"), times = 4),
sales = c(
30, 20, 15, # 2020
35, 25, 20, # 2021
40, 30, 25, # 2022
45, 35, 30 # 2023
)
)
# Create stacked bar chart
ggplot(stacked_data, aes(x = factor(year), y = sales, fill = category)) +
geom_bar(stat = "identity", position = "stack") +
scale_fill_brewer(palette = "Set2") +
theme_minimal() +
labs(
title = "Sales by Product Category Over Time",
x = "Year",
y = "Sales",
fill = "Product Category"
)
# To create a 100% stacked bar chart (proportions), just change position to "fill"
ggplot(stacked_data, aes(x = factor(year), y = sales, fill = category)) +
geom_bar(stat = "identity", position = "fill") +
scale_fill_brewer(palette = "Set2") +
scale_y_continuous(labels = scales::percent) +
theme_minimal() +
labs(
title = "Product Category Distribution Over Time",
x = "Year",
y = "Percentage",
fill = "Product Category"
)
# Treemap example with hierarchical structure
library(treemapify)
# Create sample hierarchical data
treemap_data <- data.frame(
department = rep(c("Sales", "Marketing", "R&D"), times = c(6, 4, 5)),
team = c(
# Sales teams
rep("North", 2), rep("South", 2), rep("West", 2),
# Marketing teams
rep("Digital", 2), rep("Traditional", 2),
# R&D teams
rep("Product A", 3), rep("Product B", 2)
),
project = c(
# Sales projects
"Corporate", "SMB",
"Corporate", "SMB",
"Corporate", "SMB",
# Marketing projects
"Social", "Email",
"Print", "TV",
# R&D projects
"Research", "Development", "Testing",
"Research", "Development"
),
value = c(
# Sales values
250, 150,
200, 180,
220, 160,
# Marketing values
120, 90,
100, 80,
# R&D values
150, 180, 120,
140, 160
)
)
# Create hierarchical treemap
ggplot(treemap_data,
aes(area = value,
fill = department,
subgroup = team,
label = project)) +
geom_treemap() +
geom_treemap_subgroup_border(colour = "white", size = 2) +
geom_treemap_subgroup_text(place = "centre",
grow = TRUE,
alpha = 0.5,
colour = "black",
fontface = "bold") +
geom_treemap_text(colour = "white",
place = "centre",
size = 10) +
scale_fill_brewer(palette = "Set2") +
theme_minimal() +
labs(title = "Company Structure and Project Distribution",
subtitle = "By Department, Team, and Project",
fill = "Department")
# Mosaic plot example
library(vcd)
# Sample data for mosaic plot
data <- data.frame(
gender = rep(c("M", "F"), each = 100),
age_group = rep(c("Young", "Middle", "Old"), times = c(70, 80, 50)),
education = rep(c("High", "Medium", "Low"), times = c(60, 90, 50))
)
mosaic(~ gender + age_group + education,
data = data,
shade = TRUE,
legend = TRUE)
Time series data shows how variables change over time. Visualization options include:
# Creating a line chart with time series data
library(ggplot2)
# Generate sample time series data
dates <- seq(as.Date("2024-01-01"), as.Date("2024-12-31"), by = "month")
values <- rnorm(12, mean = 100, sd = 10)
ts_data <- data.frame(date = dates, value = values)
# Time series plot
ggplot(ts_data, aes(x = date, y = value)) +
geom_line(color = "darkblue") +
geom_point() +
theme_minimal() +
labs(title = "Monthly Values Over Time",
x = "Date",
y = "Value")
Spatial data visualization helps understand geographical patterns through:
“Spatial data is a lot like categorical data, but with a geographic component. You should know the range of the data to start with, and then look for regional patterns. Are there higher or lower values clustered in a certain area of a country or continent? Because a single value only tells you about a small part about a region filled with people, think about what a pattern implies and look to other datasets to verify hunches.” - Yau, Data Points, 176
# Creating a simple choropleth map
library(maps)
library(ggplot2)
# Get US states map data
states_map <- map_data("state")
# Create sample data
state_data <- data.frame(
state = unique(states_map$region),
value = runif(length(unique(states_map$region)), 0, 100)
)
# Plot choropleth map
ggplot() +
geom_map(data = states_map, map = states_map,
aes(x = long, y = lat, map_id = region),
fill = "white", color = "grey50") +
geom_map(data = state_data, map = states_map,
aes(fill = value, map_id = state),
color = "grey50") +
scale_fill_viridis_c() +
theme_void() +
labs(title = "US States Choropleth Map",
fill = "Value")
When dealing with multiple variables, consider:
“There are a lot of visualization methods that help you explore various aspects of your data, whether it is categories, time, space, or a combination of these. You can visualize the data all at once, but you can also make use of simpler, more straightforward views, which can help extract relationships. Sometimes the relationships are straightforward between two variables, but usually the relationship is complex, especially when you introduce more than two variables. Don’t make assumptions as you explore relationships, and keep in mind there are variables not captured in the data that might contribute to changes. Finally, when it comes to correlation and causation, you need to take in all the context you can before you assign the latter.” Yau, Data Points, 189
# Creating a scatter plot with multiple variables
library(ggplot2)
# Generate sample data
set.seed(123)
n <- 100
data <- data.frame(
x = rnorm(n),
y = rnorm(n),
group = factor(sample(1:3, n, replace = TRUE)),
size = runif(n, 1, 10)
)
# Multi-variable scatter plot
ggplot(data, aes(x = x, y = y, color = group, size = size)) +
geom_point(alpha = 0.6) +
theme_minimal() +
labs(title = "Multi-variable Scatter Plot",
x = "X Variable",
y = "Y Variable")
To visualize data distributions, use:
“Regardless of the type of visualization you use to explore distributions, look for peaks and valleys, range,and the spread of your data, which tell you a lot more than just the mean and median would. The visual analysis of raw data and the variation in between the summary statistics are almost always more interesting, so make use of the opportunity when you get it.” Yau, Data Points, 199
# Creating multiple distribution plots
library(ggplot2)
# Generate sample data
data <- data.frame(
group = rep(c("A", "B", "C"), each = 100),
value = c(rnorm(100), rnorm(100, 1), rnorm(100, 2))
)
# Create violin plot with boxplot inside
ggplot(data, aes(x = group, y = value, fill = group)) +
geom_violin(alpha = 0.5) +
geom_boxplot(width = 0.2, alpha = 0.8) +
theme_minimal() +
labs(title = "Distribution Comparison",
x = "Group",
y = "Value")